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Abstract 

Numerous machine learning algorithms contain pairwise statistical problems at their core— 
that is, tasks that require computations over all pairs of input points if implemented naively. 
Often, tree structures are used to solve these problems efficiently. Dual-tree algorithms can 
efficiently solve or approximate many of these problems. Using cover trees, rigorous worst- 
case runtime guarantees have been proven for some of these algorithms. In this paper, 
we present a problem-independent runtime guarantee for any dual-tree algorithm using the 
cover tree, separating out the problem-dependent and the problem-independent elements. 
This allows us to just plug in bounds for the problem-dependent elements to get runtime 
guarantees for dual-tree algorithms for any pairwise statistical problem without re-deriving 
the entire proof. We demonstrate this plug-and-play procedure for nearest-neighbor search 
and approximate kernel density estimation to get improved runtime guarantees. Under 
mild assumptions, we also present the first linear runtime guarantee for dual-tree based 
range search. 

Keywords: dual-tree algorithms, branch and bound, nearest neighbor search, kernel 

density estimation, range search 


1. Dual-tree algorithms 

A surprising number of machine learning algorithms have computational bottlenecks that 
can be expressed as pairwise statistical problems. By this, we mean computational tasks 
that can be evaluated directly by iterating over all pairs of input points. Nearest neighbor 
search is one such problem, since for every query point, we can evaluate its distance to 
every reference point and keep the closest one. This naively requires 0(N ) time to answer 
in single query in a reference set of size A; answering 0(N ) queries subsequently requires 
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prohibitive 0(N 2 ) time. Kernel density estimation is also a pairwise statistical problem, 
since we compute a sum over all reference points for each query point. This again requires 
0(N 2 ) time to answer O(N) queries if done directly. The reference set is typically indexed 
with spatial data structures to accelerate this type of computation ( 'inkel and Bentley, 
1974; Beygelzimer et ah, 200 >); these result in O(loglV) runtime per query under favorable 
conditions. 

Building upon this intuition, and Moore (2001) generalized the fast multipole 

method from computational physics to obtain dual-tree algorithms. These are extremely 
useful when there are large query sets, not just a few query points. Instead of building a tree 
on the reference set and searching with each query point separately, Gray and Moore suggest 
also building a query tree and traversing both the query and reference trees simultaneously 
(a dual-tree traversal, from which the class of algorithms takes its name). 

Dual-tree algorithms can be easily understood through the recent framework of urtin et al. 
(2013b): two trees (a query tree and a reference tree) are traversed by a pruning dual-tree 
traversal. This traversal visits combinations of nodes from the trees in some sequence (each 
combination consisting of a query node and a reference node), calling a problem-specific 
Score () function to determine if the node combination can be pruned. If not, then a 
problem-specific BaseCaseO function is called for each combination of points held in the 
query node and reference node. This has significant similarity to the more common single¬ 
tree branch-and-bound algorithms, except that the algorithm must recurse into child nodes 
of both the query tree and reference tree. 

There exist numerous dual-tree algorithms for problems as diverse as kernel density 
estimation (Gray and Moore, 2003), mean shift (Wang et ah, 2007), minimum spanning 
tree calculation (I , 0), n-point correlation function estimation (March et ah, 

2012), max-kernel search (Curtin et ah, 2013c), particle smoothing (Klaas et , 2006), 
variational inference (Amizadeh et ah, 2012), range search (Gray and Moore, 2001), and 
embedding techniques Van Der Maaten (2014), to name a few. 

Some of these algorithms are derived using the cover tree ( ygelzimer et ah, 2006), a 
data structure with compelling theoretical qualities. When cover trees are used, Dual-tree 
all-nearest-neighbor search and approximate kernel density estimation have 0(N ) runtime 
guarantees for O(N) queries (Ram et ah, 2009a); minimum spanning tree calculation scales 
as 0(IV log IV) (March et ah, 2010). Other problems have similar worst-case guarantees 
(Curtin and Ram, 2014; March, 2013). 

In this work we combine the generalization of 'urtin ct al. (20131) with the theoretical 
results of 3eygelzimer et ah (2006) and others in order to develop a worst-case runtime 
bound for any dual-tree algorithm when the cover tree is used. 

Section 2 lays out the required background, notation, and introduces the cover tree and 
its associated theoretical properties. Readers familiar with the cover tree literature and 
dual-tree algorithms (especially Curtin et ah, 2013b) may find that section to be review. 
Following that, we introduce an intuitive measure of cover tree imbalance, an important 
property for understanding the runtime of dual-tree algorithms, in Section 3. This measure 
of imbalance is then used to prove the main result of the paper in Section 4, which is a 
worst-case runtime bound for generalized dual-tree algorithms. We apply this result to 
three specific problems: nearest neighbor search (Section 5), approximate kernel density 
estimation (Section 6), and range search / range count (Section 7), showing linear runtime 
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Symbol 

Description 

jV 

A tree node 


Set of child nodes of 


Set of points held in ,/b 


Set of descendant nodes of jVi 

m 

Set of points contained in jVi and 

Pi 

Center of jYi (for cover trees, = pt) 

N 

Furthest descendant distance 


Table 1: Notation for trees. See Curtin et al. (2013b) for details. 


bounds for each of those algorithms. Each of these bounds is an improvement on the 
state-of-the-art, and in the case of range search, is the first such bound. 

2. Preliminaries 

For simplicity, the algorithms considered in this paper will be presented in a tree-independent 
context, as in 'urtin et al. (20131 ), but the only type of tree we will consider is the cover 
tree (Beygelzimer et al., 2006), and the only type of traversal we will consider is the cover 
tree pruning dual-tree traversal, which we will describe later. 

As we will be making heavy use of trees, we must establish notation (taken from 
Curtin et ah, 2013b). The notation we will be using is defined in Table 1. 

2.1 The cover tree 

The cover tree is a leveled hierarchical data structure originally proposed for the task of 
nearest neighbor search by Beygelzimer et al. (2006). Each node jVi in the cover tree is 
associated with a single point p^. An adequate description is given in their work (we have 
adapted notation slightly): 

A cover tree ^ on a dataset S is a leveled tree where each level is a “cover” for 
the level beneath it. Each level is indexed by an integer scale Si which decreases 
as the tree is descended. Every node in the tree is associated with a point in S. 

Each point in S may be associated with multiple nodes in the tree; however, we 
require that any point appears at most once in every level. Let C Si denote the 
set of points in S associated with the nodes at level Si. The cover tree obeys 
the following invariants for all sp. 

• (Nesting). C Si C C Si -\. This implies that once a point p £ S appears in 
C Si then every lower level in the tree has a node associated with p. 

• (Covering tree). For every pi £ C Si - 1 , there exists a pj £ C Si such that 
d(pi,pj) < 2 Si and the node in level s* associated with pj is a parent of the 
node in level Si — 1 associated with pj. 

• (Separation). For all distinct Pi,pj £ C Si , d(pi,pj ) > 2 Si . 
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As a consequence of this definition, if there exists a node ./f(, containing the point pi 
at some scale Sj, then there will also exist a self-child node JVI C containing the point pi at 
scale Si — 1 which is a child of .A). In addition, every descendant point of the node <,4^ is 
contained within a ball of radius 2 Si+l centered at the point pp, therefore, A* = 2 Si+1 and 
Pi = pi (Table 1). 

Note that the cover tree may be interpreted as an infinite-leveled tree, with C 0 0 contain¬ 
ing only the root point, C-oo = S , and all levels between defined as above. Beygelzimer et al. 
(2006) find this representation (which they call the implicit representation) easier for de¬ 
scription of their algorithms and some of their proofs. But clearly, this is not suitable for 
implementation; hence, there is an explicit representation in which all nodes that have only 
a self-child are coalesced upwards (that is, the node’s self-child is removed, and the children 
of that self-child are taken to be the children of the node). 

In this work, we consider only the explicit representation of a cover tree, and do not 
concern ourselves with the details of tree construction 1 . 

2.2 Expansion constant 

The explicit representation of a cover tree has a number of useful theoretical properties 
based on the expansion constant (Kargcr and Rulil, 200 ); we restate its definition below. 

Definition 1 Let Bs(p, A) be the set of points in S within a closed ball of radius A around 
some p £ S with respect to a metric d: Bs(p,A) = {r £ S: d(p,r ) < A}. Then, the 
expansion constant of S with respect to the metric d is the smallest c > 2 such that 

\B S (P,2A)\ <c\B s (p,A)\ VpeS, V A > 0. (1) 

The expansion constant is used heavily in the cover tree literature. It is, in some sense, 
a notion of instrinic dimensionality, and previous work has shown that there are many 
scenarios where c is independent of the number of points in the dataset (Karger and Ruh , 
2002; Beygelzimer et al., 2006; Krauthgamer and Lee, 2004; Ram et al., 2009a). Note also 
that if points in S C Ti are being drawn according to a stationary distribution f(x), then c 
will converge to some finite value Cf as 151 —» oo. To see this, define Cf as a generalization 
of the expansion constant for distributions. Cf > 2 is the smallest value such that 

[ f{x)dx <Cf[ f(x)dx (2) 

JBhIp, 2 A) JB H (pA) 

for all p £ Ti and A > 0 such that Jg n ( p A ) f{x)dx > 0, and with Bpipp, A) defined as the 
closed ball of radius A in the space Ti. 

As a simple example, take f(x) as a uniform spherical distribution in 1Z d : for any |x| < 1, 
f(x) is a constant; for \x\ > 1, f(x) = 0. It is easy to see that c/ in this situation is 2 d , and 
thus for some dataset S, c must converge to that value as more and more points are added 
to S. Closed-form solutions for Cf for more complex distributions are less easy to derive; 
however, empirical speedup results from Beygelzimer et al. (2006) suggest the existence of 
datasets where c is not strongly dependent on d. For instance, the covtype dataset has 

1. A batch construction algorithm is given by Beygelzimer et al. (2006), called Construct. 
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54 dimensions but the expansion constant is much smaller than other, lower-dimensional 
datasets. 

There are some other important observations about the behavior of c. Adding a single 
point to S may increase c arbitrarily: consider a set S distributed entirely on the surface 
of a unit hypersphere. If one adds a single point at the origin, producing the set S', then 
c explodes to |S' , | whereas before it may have been much smaller than 151. Adding a 
single point may also decrease c significantly. Suppose one adds a point arbitrarily close 
to the origin to S'] now, the expansion constant will be |5 , |/2. Both of these situations 
are degenerate cases not commonly encountered in real-world behavior; we discuss them in 
order to point out that although we can bound the behavior of c as |Sj —> oo for S from a 
stationary distribution, we are not able to easily say much about its convergence behavior. 

The expansion constant can be used to show a few useful bounds on various properties 
of the cover tree; we restate these results below, given some cover tree built on a dataset S 
with expansion constant c and (S' = N: 

• Width bound: no cover tree node has more than c 4 children (Lemma 4.1, 3eygelzimer et al. 

(2006)). 

• Depth bound: the maximum depth of any node is 0(c 2 log N) (Lemma 4.3, Beygelzimer et al. 

(2006)). 

• Space bound: a cover tree has 0(N ) nodes (Theorem 1, Beygelzimer et al. (2006)). 

Lastly, we introduce a convenience lemma of our own which is a generalization of the 
packing arguments used by Beygelzimer et al. (2006). This is a more flexible version of 
their argument. 

Lemma 1 Consider a dataset S with expansion constant c and a subset C C S such that 
every point in C is separated by 5. Then, for any point p (which may or may not be in S), 
and any radius p5 > 0: 

\B s (p,pS)nC\ < c 2 +r io g 2 p! . ( 3 ) 


Proof The proof is based on the packing argument from Lemma 4.1 in Beygelzimer et al. 
(2006). Consider two cases: first, let d(p,pi ) > p5 for any pi £ S. In this case, Bgfp, p5) = 0 
and the lemma holds trivially. Otherwise, let pi £ S be a point such that d(p,pi) < p5. 
Observe that B s (p,p5) C B s (pi,2p5). Also, \B s (pi, 2p5)\ = c 2+ f log 2 p\\B s (jpi, <5/2)1 by the 
definition of the expansion constant. Because each point in C is separated by 5, the number 
of points in Bg(p, p5 ) fl C is bounded by the number of disjoint balls of radius 5/2 that can 
be packed into Bg(p,p5). In the worst case, this packing is perfect, and 


I B s (p,p5)\ < 


\B s (pi, 2p5)\ 
\B s (Pi,5/2)\ 


< c 2+flog 2 pl _ 


( 4 ) 
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(a) Balanced cover tree. (b) Imbalanced cover tree. 

Figure 1: Balanced and imbalanced cover trees. 


3. Tree imbalance 

It is well-known that imbalance in trees leads to degradation in performance; for instance, 
a fed-tree node with every descendant in its left child except one is effectively useless. A 
fed-tree full of nodes like this will perform abysmally for nearest neighbor search, and it is 
not hard to generate a pathological dataset that will cause a fed-tree of this sort. 

This sort of imbalance applies to all types of trees, not just fed-trees. In our situation, we 
are interested in a better understanding of this imbalance for cover trees, and thus endeavor 
to introduce a more formal measure of imbalance which is correlated with tree performance. 
Numerous measures of tree imbalance have already been established; one example is that 
proposed by Colless (1982), and another is Sackin’s index (Sackin, 1972), but we aim to 
capture a different measure of imbalance that utilizes the leveled structure of the cover tree. 

We already know each node in a cover tree is indexed with an integer level (or scale). 
In the explicit representation of the cover tree, each non-leaf node has children at a lower 
level. But these children need not be strictly one level lower; see Figure 1. In Figure la, 
each cover tree node has children that are strictly one level lower; we will refer to this as 
a perfectly balanced cover tree. Figure lb, on the other hand, contains the node <jV m which 
has two children with scale two less than s m . We will refer to this as an imbalanced cover 
tree. Note that in our definition, the balance of a cover tree has nothing to do with differing 
number of descendants in each child branch but instead only missing levels. 

An imbalanced cover tree can happen in practice, and in the worst cases, the imbalance 
may be far worse than the simple graphs of Figure 1. Consider a dataset with a single 
outlier which is very far away from all of the other points 2 . Figure 2 shows what happens in 
this situation: the root node has two children; one of these children has only the outlier as 
a descendant, and the other child has the rest of the points in the dataset as a descendant. 
In fact, it is easy to find datasets with a handful of outliers that give rise to a chain-like 
structure at the top of the tree: see Figure 3 for an illustration 3 . 


2. Note also that for an outlier sufficiently far away, the expansion constant is N — 1, so we should expect 
poor performance with the cover tree anyway. 

3. As a side note, this behavior is not limited to cover trees, and can happen to mean-split fed-trees too, 
especially in higher dimensions. 
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Figure 2: Single-outlier cover tree. 



Figure 3: A multiple-outlier cover tree. 


A tree that has this chain-like structure all the way down, which is similar to the kd- tree 
example at the beginning of this section, is going to perform horrendously; motivated by 
this observation, we define a measure of tree imbalance. 

Definition 2 The cover node imbalance i n {^K) for a cover tree node jVi with scale Si in 
the cover tree 22 is defined as the cumulative number of missing levels between the node 
and its parent jV p (which has scale s p ). If the node is a leaf child (that is, Si = —oo), then 
number of missing levels is defined as the difference between s p and s m i n — 1 where s m in is 
the smallest scale of a non-leaf node in 2?. If is the root of the tree, then the cover node 
imbalance is 0. Explicitly written, this calculation is 


In (^0 


s p — Si — 1 if is not a leaf and not the root node 

< max(s p — s m i n — 1, 0) if is a leaf 

0 if is the root node. 


( 5 ) 


This simple definition of cover node imbalance is easy to calculate, and using it, we can 
generalize to a measure of imbalance for the full tree. 


Definition 3 The cover tree imbalance it{22) for a cover tree 22 is defined as the cumulative 
number of missing levels in the tree. This can be expressed as a function of cover node 
imbalances easily: 


h{&) = 22 in ^%)' 


( 6 ) 
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A perfectly balanced cover tree ^ with no missing levels has imbalance = 0 (for 

instance, Figure la). A worst-case cover tree 3T W which is entirely a chain-like structure with 
maximum scale s max and minimum scale s m i n will have imbalance it(^w) ~ N(_s max — S min ). 
Because of this chain-like structure, each level has only one node and thus there are at least 
N levels; or, s max — s m in > N, meaning that in the worst case the imbalance is quadratic 
in N 4 . 

However, for most real-world datasets with the cover tree implementation in mlpack 
(Curtin et al., 2013a) and the reference implementation (Beygelzimer et ah, 2006), the tree 
imbalance is near-linear with the number of points. Generally, most of the cover tree 
imbalance is contributed by leaf nodes whose parent has scale greater than s m i n . At this 
time, no cover tree construction algorithm specifically aims to minimize imbalance. 

4. General runtime bound 

Perhaps more interesting than measures of tree imbalance is the way cover trees are actu¬ 
ally used in dual-tree algorithms. Although cover trees were originally intended for nearest 
neighbor search (See Algorithm Find-All-Nearest, Beygelzimer et al., 2006), they can be 
adapted to a wide variety of problems: minimum spanning tree calculation (March et ah, 
2010), approximate nearest neighbor search ( am cl , 2009b), Gaussian processes poste¬ 
rior calculation (Moore and Russell, 2014), and max-kernel search (Curtin and Ram, 2014) 
are some examples. Further, through the tree-independent dual-tree algorithm abstraction 
of Curtin et a ( 13b), other existing dual-tree algorithms can easily be adapted for use 

with cover trees. 

In the framework of tree-independent dual-tree algorithms, all that is necessary to de¬ 
scribe a dual-tree algorithm is a point-to-point base case function (BaseCaseO) and a 
node-to-node pruning rule (Score()). These functions, which are often very straightfor¬ 
ward, are then paired with a type of tree and a pruning dual-tree traversal to produce a 
working algorithm. In later sections, we will consider specific examples. 

When using cover trees, the typical pruning dual-tree traversal is an adapted form of 
the original nearest neighbor search algorithm (see Find-All-Nearest, Beygelzimer et al., 
2006); this traversal is implemented in both the cover tree reference implementation and in 
the more flexible mlpack library ( Jurtin et al., 2013 i). The problem-independent traversal 
is given in Algorithm 1 and was originally presented by Curtin and Ram (: ). Initially, 

it is called with the root of the query tree and a reference set R containing only the root of 
the reference tree. 

This dual-tree recursion is a depth-first recursion in the query tree and a breadth-first 
recursion in the reference tree; to this end, the recursion maintains one query node jVq and 
a reference set R. The set R may contain reference nodes with many different scales; the 
maximum scale in the reference set is s™ ax (line 3). Each single recursion will descend either 
the query tree or the reference tree, not both; the conditional in line 4, which determines 
whether the query or reference tree will be recursed, is aimed at keeping the relative scales 
of query nodes and reference nodes close. 

A query recursion (lines 13-18) is straightforward: for each child of the node 
combinations (M^ c , jV r ) are scored for each ,yV r in the reference set R. If possible, these 


4. Note that in this situation, c ~ N also. 



Algorithm 1 The standard pruning dual-tree traversal for cover trees. 
1: Input: query node jV q , set of reference nodes R 

2: Output: none 

3: s” ax -5- max^ reK s r 
4: if (sg < s™ ax ) then 
5: {Perform a reference recursion.} 

6: for each JF r £ R do 

7: BaseCase(p g , p r ) 

8: end for 

9: R r £- {jV r £ R : S r = S™ ax } 

10: R r —1 ■ <SK 6 Rr} U (R\ R r ) 

11: R'r-i {^K G Rr—i ■ Score ,^Y r ) oo} 

12: recurse with jV q and R' r _i 

13: else 

14: {Perform a query recursion.} 

15: for each jF qc £ to^jVq) do 

16: R' £- {jV r £ R : Score oo} 

17: recurse with jV qc and R' 

18: end for 

19: end if 


combinations are pruned to form the set R' (line 17) by checking the output of the Score () 
function, and then the algorithm recurses with and R'. 

A reference recursion (lines 4-12) is similar to a query recursion, but the pruning strategy 
is significantly more complicated. Given R, we calculate R r , which is the set of nodes in 
R that have scale s™ ax . Then, for each node in the set of children of nodes in R r , the 
node combinations are scored and pruned if possible. Those reference nodes that 

were not pruned form the set Then, this set is combined with R \ R r —that is, each 

of the nodes in R that was not recursed into—to produce R', and the algorithm recurses 
with and the reference set R'. 

The reference recursion only recurses into the top-level subset of the reference nodes in 
order to preserve the separation invariant. It is easy to show that every pair of points held 
in nodes in R is separated by at least 2 s ™: 

Lemma 2 For all nodes jFj £ R (in the context of Algorithm 1) which contain points 
Pi andpj, respectively, d(p l ,p J ) > 2 s ™, with s™ ax defined as in line 3. 

Proof This proof is by induction. If \R\ = 1, such as during the first reference recursion, 
the result obviously holds. Now consider any reference set R and assume the statement of 
the lemma holds for this set R, and define s™ ax as the maximum scale of any node in R. 
Construct the set R r ~i as in line 10 of Algorithm 1; if |72 r _i| < 1, then R r ~i satisfies the 
desired property. 

Otherwise, take any in R r ~i, with points pt and pj, respectively, and scales 

Si and Sj, respectively. Clearly, if Si = Sj = — 1, then by the separation invariant 

d(pi,pj ) > 2 s ™ -1 . 


9 






Now suppose that s* < s™ ax — 1. This implies that there exists some implicit cover tree 
node with point pt and scale s™ ax — 1 (as well an implicit child of this node pi with scale 
s max _ 2 anc j so forth until one of these implicit nodes has child p % with scale s t ). Because 
the separation invariant applies to both implicit and explicit representations of the tree, we 
conclude that d(pi,p r ) > 2 s ™ ax — 1. The same argument may be made for the case where 
Sj < s“ ax — 1, with the same conclusion. 

We may therefore conclude that each point of each node in R r -\ is separated by 2 s ™ ax ~ 1 . 
Note that R' r _i C R r -\ and that R \ R r -\ C R in order to see that this condition holds for 
all nodes in R' = R' r _ 1 U (R \ R r -i). 

Because we have shown that the condition holds for the initial reference set and for any 
reference set produced by a reference recursion, we have shown that the statement of the 
lemma is true. ■ 


This observation means that the set of points P held by all nodes in R is always a subset 
of C s max. This fact will be useful in our later runtime proofs. 

Next, we develop notions with which to understand the behavior of the cover tree dual¬ 
tree traversal when the datasets are of significantly different scale distributions. 

If the datasets are similar in scale distribution (that is, inter-point distances tend to 
follow the same distribution), then the recursion will alternate between query recursions and 
reference recursions. But if the query set contains points which are, in general, much farther 
apart than the reference set, then the recursion will start with many query recursions before 
reaching a reference recursion. The converse case also holds. We are interested in formalizing 
this notion of scale distribution; therefore, define the following dataset-dependent constants 
for the query set S q and the reference set S r : 

• Tj q : the largest pairwise distance in S q 

• 5 q : the smallest nonzero pairwise distance in S q 

• r/ r : the largest pairwise distance in S r 

• 5 r : the smallest nonzero pairwise distance in S r 

These constants are directly related to the aspect ratio of the datasets; indeed, p q /S q is 
exactly the aspect ratio of S q . Further, let us define and bound the top and bottom levels 
of each tree: 

• The top scale s^ of the query tree £? q is such that as [log 2 (??q)l — 1 < s^ < [log 2 (r? 9 )]. 

• The minimum scale of the query tree 2F r is defined as s™ in = |"log 2 (5 g )]. 

• The top scale sj of the reference tree £? r is such that as |"log 2 (?y r )] — 1 < sj < 
riog 2 (r/r)l. 

• The minimum scale of the reference tree 2? r is defined as s“ lin = |dog 2 ((5 r )]. 
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Note that the minimum scale is not the minimum scale of any cover tree node (that 
would be —oo), but the minimum scale of any non-leaf node in the tree. 

Suppose that our datasets are of a similar scale distribution: s^ = sj, and s™ in = 

In this setting we will have alternating query and reference recursions. But if this is not the 
case, then we have extra reference recursions before the first query recursion or after the 
last query recursion (situations where both these cases happen are possible). Motivated by 
this observation, let us quantify these extra reference recursions: 

Lemma 3 For a dual-tree algorithm with S q ~ S r ~ 0(N ) using cover trees and the 
traversal given in Algorithm 1, the number of extra reference recursions that happen before 
the first query recursion is bounded by 

min (O(N),log 2 (r] r /r] q ) - 1). (7) 

Proof The first query recursion happens once s q > s™ ax . The number of reference re¬ 
cursions before the first query recursion is then bounded as the number of levels in the 
reference tree between s)T and s^ that have at least one explicit node. Because there are 
0(N ) nodes in the reference tree, the number of levels cannot be greater than 0(N ) and 
thus the result holds. 

The second bound holds by applying the definitions of sj and sj to the expression 
sj — Sg — 1: 


Sr-Sq- 1 < riog 2 (r?r)l - (riog 2 (r/ ? )l - 1) - 1 (8) 

< log 2 (vr) + 1 - log 2 (? 7 <? ) (9) 

which gives the statement of the lemma after applying logarithmic identities. ■ 

Note that the 0(N) bound may be somewhat loose, but it suffices for our later purposes. 
Now let us consider the other case: 

Lemma 4 For a dual-tree algorithm with S q ~ S r ~ 0(N) using cover trees and the 
traversal given in Algorithm 1, the number of extra reference recursions that happen after 
the last query recursion is bounded by 

6 = max {min [0(N log 2 (S q /d r )), 0(N 2 )] , 0} . (10) 

Proof Our goal here is to count the number of reference recursions after the final query 
recursion at level s™ in ; the first of these reference recursions is at scale s™ ax = -s™ in . Because 
query nodes are not pruned in this traversal, each reference recursion we are counting will 
be duplicated over the whole set of O(N) query nodes. The first part of the bound follows 
by observing that s“ in - < |"log 2 (<5 ? )"| - |"log 2 (<5 r )"| - 1 < log 2 ( 5 q /5 r ). 

The second part follows by simply observing that there are 0(N ) reference nodes. ■ 


These two previous lemmas allow us a better understanding of what happens as the 
reference set and query set become different. Lemma 3 shows that the number of extra 
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recursions caused by a reference set with larger pairwise distances than the query set (j] r 
larger than rj q ) is modest; on the other hand, Lemma 4 shows that for each extra level in 
the reference tree below s™ n , O(N) extra recursions are required. Using these lemmas and 
this intuition, we will prove general runtime bounds for the cover tree traversal. 

Theorem 1 Given a reference set S r of size N with an expansion constant c r and a set 
of queries S q of size 0{N), a standard cover tree based dual-tree algorithm (Algorithm 1) 
takes 


O(4\R*M(N + i t m + 0)) (11) 

time, where |i?*| is the maximum size of the reference set R (line 1) during the dual-tree 
recursion, x the maximum possible runtime of BaseCaseO, if is the maximum possible 
runtime of ScoreO, and 9 is defined as in Lemma f. 

Proof First, split the algorithm into two parts: reference recursions (lines 4-12) and query 
recursions (lines 13-18). The runtime of the algorithm is bounded as the runtime of a 
reference recursion times the total number of reference recursions plus the total runtime of 
all query recursions. 

Consider a reference recursion (lines 4-12). Define R* to be the largest set R for any 
scale s™ ax and any query node .Af during the course of the algorithm; then, it is true that 
\R\ < \R* |. The work done in the base case loop from lines 6-8 is thus 0(\|i?|) < 0(x|i?*|). 
Then, lines 10 and 11 take 0(cfip\R\) < 0(cfif\R*\) time, because each reference node has 
up to c* children. So, one full reference recursion takes 0(cf'tfx\R*\) time. 

Now, note that there are 0(N ) nodes in 3L q . Thus, line 17 is visited O(N) times. 
The amount of work in line 16, like in the reference recursion, is bounded as 0(cfif\R*\). 
Therefore, the total runtime of all query recursions is 0(cfip\R*\N). 

Lastly, we must bound the total number of reference recursions. Reference recursions 
happen in three cases: (1) s™ ax is greater than the scale of the root of the query tree (no 
query recursions have happened yet); (2) s“ ax is less than or equal to the scale of the root 
of the query tree, but is greater than the minimum scale of the query tree that is not —oo; 
(3) s™ ax is less than the minimum scale of the query tree that is not —oo. 

First, consider case (1). Lemma 3 shows that the number of reference recursions of this 
type is bounded by 0(N). Although there is also a bound that depends on the sizes of the 
datasets, we only aim to show a linear runtime bound, so the 0(N) bound is sufficient here. 

Next, consider case (2). In this situation, each query recursion implies at least one 
reference recursion before another query recursion. For some query node ^Vq, the exact 
number of reference recursions before the children of ^V q are recursed into is bounded above 
by ini'Z'Yq) + 1: if ^ q has imbalance 0, then it is exactly one level below its parent, and thus 
there is only one reference recursion. On the other hand, if ^ is many levels below its 
parent, then it is possible that a reference recursion may occur for each level in between; 
this is a maximum of ini^q) + L 

Because each query node in 3L q is recursed into once, the total number of reference 
recursions before each query recursion is 

Y, in(^ q ) + 1 = it(^q) + O(N) (12) 
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since there are 0(N) nodes in the query tree. 

Lastly, for case (3), we may refer to Lemma 4, giving a bound of 6 reference recursions 
in this case. 

We may now combine these results for the runtime of a query recursions with the total 
number of reference recursions in order to give the result of the theorem: 


o {4\R*Wx (N + i t (£r q ) + 6))+0 {4\R*m) ~ O {cf\R*\ifx (N + i t (£T q ) + 0 )). (13) 


When we consider the monochromatic case (where S q = S r ), the results trivially simplify. 

Corollary 1 Given the situation of Theorem 1 but with S q = S r = S so that c q = c r = c 
and 2f q = ^, & dual-tree algorithm using the standard cover tree traversal (Algorithm 

1) takes 


0{c A \R*\x^{N + i t {Sr))) (14) 

time, where |i?*| is the maximum size of the reference set R (line 1) during the dual¬ 
tree recursion, x the maximum possible runtime of BaseCaseO, and if is the maximum 
possible runtime of ScoreO. 

An intuitive understanding of these bounds is best achieved by first considering the 
monochromatic case (this case arises, for instance, in all-nearest-neighbor search). The 
linear dependence on N arises from the fact that all query nodes must be visited. The 
dependence on the reference tree, however, is encapsulated by the term c 4 |i?*|, with \R*\ 
being the maximum size of the reference set R; this value must be derived for each specific 
problem. The bad performance of poorly-behaved datasets with large c (or, in the worst 
case, c ~ N) is then captured in both of those terms. Poorly-behaved datasets may also 
have a high cover tree imbalance it(3?)', the linear dependence of runtime on imbalance is 
thus sensible for well-behaved datasets. 

The bichromatic case (S q ^ S r ) is a slightly more complex result which deserves a bit 
more attention. The intuition for all terms except 6 remain virtually the same. 

The term 6 captures the effect of query and reference datasets with different widths, and 
has one unfortunate corner case: when 5 q > rj r , then the query tree must be entirely de¬ 
scended before any reference recursion. This results in a bound of the form 0{N\og{g r /5 r )), 
or 0(N 2 ) (See Lemma 4). This is because the reference tree must be descended individually 
for each query point. 

The quantity \R*\ bounds the amount of work that needs to be done for each recursion. 
In the worst case, \R*\ can be N. However, dual-tree algorithms rely on branch-and-bound 
techniques to prune away work (Lines 11 and 16 in Algorithm 1). A small value of \R*\ will 
imply that the algorithm is extremely successful in pruning away work. An (upper) bound 
on \R*\ (and the algorithm’s success in pruning work) will depend on the problem and the 
data. As we will show, bounding \R*\ is often possible. For many dual-tree algorithms, 
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x ~ rs«/ 0(1); often, cached sufficient statistics (Moore, 2000) can enable 0(1) runtime 
implementations of BaseCaseO and Score(). 

These results hold for any dual-tree algorithm regardless of the problem. Hence, the 
runtime of any dual-tree algorithm would be at least 0(N ) using our bound, which matches 
the intuition that answering O(N) queries would take at least 0(N) time. For a particular 
problem and data, if c r , |i?*|, %, -0 are bounded by constants independent of N and 9 is no 
more than linear in N (for large enough N), then the dual-tree algorithm for that problem 
has a runtime linear in N. Our theoretical result separates out the problem-dependent and 
the problem-independent elements of the runtime bound, which allows us to simply plug in 
the problem-dependent bounds in order to get runtime bounds for any dual-tree algorithm 
without requiring an analysis from scratch. 

Our results are similar to that of Ram et al. (2009a), but those results depend on a 
quantity called the constant of bichromaticity , denoted n. which has unclear relation to 
cover tree imbalance. The dependence on k is given as c^ K , which is not a good bound, 
especially because k may be much greater than 1 in the bichromatic case (where S q S r ). 

The more recent results of Curtin and Ram (201 ) are more related to these results, 
but they depend on the inverse constant of bichromaticity v which suffers from the same 
problem as k. Although the dependence on v is linear (that is, 0(yN )), bounding v is 
difficult and it is not true that v = 1 in the monochromatic case. 

v corresponds to the maximum number of reference recursions between a single query 
recursion, and k corresponds to the maximum number of query recursions between a single 
reference recursion. The respective proofs that use these constants then apply them as a 
worst-case measure for the whole algorithm: when using k, Ram el (2009a) assume that 
every reference recursion may be followed by k query recursions; similarly, Curt in and Ram 
(2011) assume that every query recursion may be followed by v reference recursions. Here, 
we have simply used it(^q) and 9 as an exact summation of the total extra reference 
recursions, which gives us a much tighter bound than v or k on the running time of the 
whole algorithm. 

Further, both v and k are difficult to empirically calculate and require an entire run of 
the dual-tree algorithm. On the other hand, bounding it(^q) (and 9) can be done in one 
pass of the tree (assuming the tree is already built). Thus, not only is our bound tighter 
when the cover tree imbalance is sublinear in N, it more closely reflects the actual behavior 
of dual-tree algorithms, and the constants which it depends upon are straightforward to 
calculate. 

In the following sections, we will apply our results to specific problems and show the 
utility of our bound in simplifying runtime proofs for dual-tree algorithms. 

5. Nearest neighbor search 

The standard task of nearest neighbor search can be simply described: given a query 
set S q and a reference set S r , for each query point p q £ S qi find the nearest neighbor p r 
in the reference set S r . The task is well-studied and well-known, and there exist numerous 
approaches for both exact and approximate nearest neighbor search, including the cover 
tree nearest neighbor search algorithm due to Beygelzimer et al. (2006). We will consider 
that algorithm, but in a tree-independent sense as given by Curtin et al (2013b); this 


14 


Algorithm 2 Nearest neighbor search BaseCaseO 

Input: query point p q , reference point p r , list of candidate neighbors N and distances D 

Output: distance d between p q and p r 

if d(p q ,p r ) < D[p q \ and BaseCase(p g , p r ) not yet called then 
D[p q ] <r~ d(p q ,p r ), and N\p q \ p r 

end if 

return d(p q ,p r ) 


Algorithm 3 Nearest neighbor search Score() 

Input: query node ^Y q , reference node jV r 

Output: a score for the node combination ('A q ,jY r ), or oo if the combination should be 
pruned 

if d m\n (^K v < B{jy r q ) then 
ret urn d m ; n ( jV q , jV r ) 

end if 
return oo 


means that to describe the algorithm, we require only a BaseCaseO and Score () function; 
these are given in Algorithms 2 and 3, respectively. The point-to-point BaseCase () function 
compares a query point p q and a reference point p r , updating the list of candidate neighbors 
for p q if necessary. 

The node-to-node Score () function determines if the entire subtree of nodes under the 
reference node can improve the candidate neighbors for all descendant points of the 
query node ^Y q \ if not, the node combination is pruned. The Score () function depends on 
the function d m ; n (-, •), which represents the minimum possible distance between any two 
descendants of two nodes. Its definition for cover tree nodes is 

dmini^q^r) = d{ Pq ,p r ) ~ 2 Sq + 1 - 2^ + 1 . (15) 

Given a type of tree and traversal, these two functions store the current nearest neighbor 
candidates in the array N and their distances in the array D. (See Curtin et ah, 2013b, 
for a more complete discussion of how this algorithm works and a proof of correctness.) 
The Score () function depends on a bound function B(jV q ) which represents the maximum 
distance that could possibly improve a nearest neighbor candidate for any descendant point 
of the query node ^V q . The standard bound function B(^V q ) used for cover trees is adapted 
from (Beygelzimer et ah, 2006): 


B{JQ) := D[p q ] + 2 S « +1 (16) 

In this formulation, the query node jV q holds the the query point p q , the quantity 
D[p q \ is the current nearest neighbor candidate distance for the query point p q , and 2 Sq+l 
corresponds to the furthest descendant distance of ./f). For notational convenience in the 
following proof, take c qr = max((ma x Pq£ s q c' r ), c r ), where c' r is the expansion constant of 
the set S r U {p q }- 
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Theorem 2 Using cover trees, the standard cover tree pruning dual-tree traversal, and the 
nearest neighbor search BaseCaseO and Score () as given in Algorithms 2 and 3, respec¬ 
tively, and also given a reference set S r with expansion constant c r , and a query set S q , 
the running time of the algorithm is bounded by 0(c^Cg r (N + it(^q) + #)) with it(3F q and 6 
defined as in Definition 3 and Lemma 4, respectively. 

Proof The running time of BaseCaseO and ScoreO are clearly 0(1). Due to Theorem 1, 
we therefore know that the runtime of the algorithm is bounded by 0(cf\R*\(N -\-it(47q)-{-9)). 
Thus, the only thing that remains is to bound the maximum size of the reference set, \R*\. 

Assume that when R* is encountered, the maximum reference scale is and the 
query node is PU q . Every node £ R* satisfies the property enforced in line 11 that 
rf min Mq. JK) < B[uU q ). Using the definition of d m i n (', ■) and B(-), we expand the equation. 
Note that p q is the point held in ^V q and p r is the point held in OO Also, take p r to be the 


current nearest neighbor candidate for 

Pq] that is, D[p q ] = d(p q ,Pr) and N[p q \ 

= p r . Then, 


< 

B(-yT q ) 

(17) 

d(Pq,Pr) 

< 

d{p q ,Pr ) + 2 S 1 +1 + 2 Sr+1 + 2 Sq+1 

(18) 


< 

d{Pq,Pr) + 2(2 S “ aX+1 ) 

(19) 


where the last step follows because s q + 1 < s“ ax and s r < s™ ax . Define the set of points 
P as the points held in each node in R* (that is, P = {p r £ 3 d Y r £ R*})- Then, we 

can write 


p c B Sr (p q ,d(p q ,p r ) + 2(2 s ” ax+1 )). (20) 

Suppose that the true nearest neighbor is p* and d(p q ,p*) > 2 s ™ ax+1 . Then, p* must 
be held as a descendant point of some node in R* which holds some point p r . Using the 
triangle inequality, 


d(p q ,Pr ) < d(p q ,pr) < d(p q ,p *) + d{p r ,p* r ) < d(p q ,p* r ) + 2 s ” ax+1 . (21) 

This gives that P C B SrU r p }(p q ,d(p q ,p*) + 3(2 s “ ax+1 )). The previous step is necessary: 
to apply the definition of the expansion constant, the ball must be centered at a point in 
the set; now, the center (jp q ) is part of the set. 

\B SrU { Pq} (p q ,d(p q ,p* r )+ 3{2 S ™* +1 ))\ < \B SrU{pq} (p q ,M{p q ,p* r ))\ (22) 

< c 3 q r\ B SrU{p q }(Pq,d(p q ,p*)/2)\ (23) 

which follows because the expansion constant of the set S r U {p q } is bounded above by c qr . 
Next, we know that p* is the closest point to p q in S r U { p q }; thus, there cannot exist a 
point p' r / p q £ S r U {p q } such that p' r £ Bs qr (p q , d(p q ,p*)/ 2) because that would imply 
that d{p q ,p' r ) < d(p q , p* ), which is a contradiction. Thus, the only point in the ball is p q , 
and we have that \B SrU ^ Pq y(p q , d(p q ,p*)/2)\ = 1, giving the result that \R\ < c qr in this 
case. 
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The other case is when d(p q ,p*) < 2 S ™ +1 , which means that d(p q ,p r ) < 2 S ™ +2 . Note 
that P G C s max, and therefore 


P C 
c 


BSri'Pqi d(p q ,p*) + 3(2 S ” aX + 1 )) n C s max 

BSriPqi 4(2 Sr +1 ))nC' s max. 


(24) 

(25) 


Every point in C^max is separated by at least 2 s ™. Using Lemma 1 with <5 = 2 s ™ and 


p = 8 yields that \P\ < tf. 


This gives the result, because c® < Cg r . 


In the monochromatic case where S q = S r , the bound is 0(c 9 (N + it(^P)) because 
c = c r = c qr and 9 = 0. For well-behaved trees where it.(^q) is linear or sublinear in N, this 
represents the current tightest worst-case runtime bound for nearest neighbor search. 


6. Approximate kernel density estimation 

Ram et al. (2009a) present a clever technique for bounding the running time of approximate 
kernel density estimation based on the properties of the kernel, when the kernel is shift- 
invariant and satisfies a few assumptions. We will restate these assumptions and provide 
an adapted proof using Theorem 1, which gives a tighter bound. 

Approximate kernel density estimation is a common application of dual-tree algorithms 
(Gray and Moore, 2003, 2001). Given a query set S q , a reference set S r of size N, and a 
kernel function /C(-, •), the true kernel density estimate for a query point p q is given as 

f*(Pq) = ^ £(Pq,Pr)- (26) 

Pr^Sr 

In the case of an infinite-tailed kernel /C(-, •), the exact computation cannot be accel¬ 
erated; thus, attention has turned towards tractable approximation schemes. Two simple 
schemes for the approximation of f*(p q ) are well-known: absolute value approximation and 
relative value approximation. Absolute value approximation requires that each density es¬ 
timate f(p q ) is within e of the true estimate f*(p q ): 


I f{Pq) - f*(p q ) | < e Vp q G S q . (27) 

Relative value approximation is a more flexible approximation scheme; given some pa¬ 
rameter e, the requirement is that each density estimate is within a relative tolerance of 
f*(Pq) ■■ 


\f(Pq) - f*(Pq) I 

\f*(Pq)\ 


< e Mp q G Sq. 


(28) 


Kernel density estimation is related to the well-studied problem of kernel summation, 
which can also be solved with dual-tree algorithms (Lee and Gray, 2006, 2009). In both of 
those problems, regardless of the approximation scheme, simple geometric observations can 
be made to accelerate computation: when JC(-, •) is shift-invariant, faraway points have very 
small kernel evaluations. Thus, trees can be built on S q and S r , and node combinations can 
be pruned when the nodes are far apart while still obeying the error bounds. 
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Algorithm 4 Approximate kernel density estimation BaseCaseQ 

1: Input: query point p q , reference point p r , list of kernel point estimates f p 
2: Output: kernel value K,{p q ,p r ) 

3 : fp{Pq ) ^ fpiPq ) H - ^(PqiPr ) 

4: return JC(p q ,p r ) 


In the following two subsections, we will separately consider both the absolute value 
approximation scheme and the relative value approximation scheme, under the assumption 
of a shift-invariant kernel JC(p q ,p r ) = IC(\\p q — p r \\) which is monotonically decreasing and 
non-negative. In addition, we assume that there exists some bandwidth h such that /C(d) 
must be concave for d € [0, h\ and convex for d € [h, oo). This assumption implies that 
the magnitude of the derivative \JC(d)\ is maximized at d = h. These are not restrictive 
assumptions; most standard kernels fall into this class, including the Gaussian, exponential, 
and Epanechnikov kernels. 

6.1 Absolute value approximation 

A tree-independent algorithm for solving approximate kernel density estimation with ab¬ 
solute value approximation under the previous assumptions on the kernel is given as a 
BaseCaseO function in Algorithm 4 and a Score () function in Algorithm 5 (a correctness 
proof can be found in Curtin ct al., 2013b). The list f p holds partial kernel density estimates 
for each query point, and the list f n holds partial kernel density estimates for each query 
node. At the beginning of the dual-tree traversal, the lists f p and f n , which are both of size 
O(N), are each initialized to 0. As the traversal proceeds, node combinations are pruned if 
the difference between the maximum kernel value /C(cZ m i n (A^, Or)) and the minimum kernel 
value /C(d max (A^, Of)) is sufficiently small (line 3). If the node combination can be pruned, 
then the partial node estimate is updated (line 4). When node combinations cannot be 
pruned, BaseCaseO may be called, which simply updates the partial point estimate with 
the exact kernel evaluation (line 3). 

After the dual-tree traversal, the actual kernel density estimates / must be extracted. 
This can be done by traversing the query tree and calculating f(p q ) = fp(Pq)+J2, ViCT fn(Of), 
where T is the set of nodes in TT q that have p q as a descendant. Each query node needs to 
be visited only once to perform this calculation; it may therefore be accomplished in 0(N ) 
time. 

Note that this version is far simpler than other dual-tree algorithms that have been pro¬ 
posed for approximate kernel density estimation (see, for instance Gray and Moore, 2003); 
however, this version is sufficient for our runtime analysis. Real-world implementations, 
such as the one found in mlpack (Curtin ct al., 2013a), tend to be far more complex. 

Theorem 3 Assume that /C(-, •) is a kernel satisfying the assumptions of the previous sub¬ 
section. Then, given a query set S q and a reference set S r with expansion constant c r , 
and using the approximate kernel density estimation BaseCaseO and Score () as given 
in Algorithms 4 an d 5, respectively, with the traversal given in Algorithm 1, the running 
time of approximate kernel density estimation for some error parameter e is bounded by 
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Algorithm 5 Absolute-value approximate kernel density estimation Score () 

1: Input: query node DV qi reference node list of node kernel estimates f n 
2: Output: a score for the node combination (A^,^), or oo if the combination should 
be pruned 

3: if <^r)) ^ € then 

4: /„(/,) <- /„(/,) + |^ P (^)| (/C(d min (^, ^)) + /C(d max (^, ^))) / 2 

5: return oo 

6 : end if 

7: return /C(d m ; n (Aq, Dfi)) Al(d max (t/)q, 


0(cr + ^° S2 ^ (A + it(^q) + ^)) with ( = —JC'(h)lC 1 (e)e 1 , it{^q) defined as in Definition 3, 
and 9 defined as in Lemma 4- 

Proof It is clear that BaseCaseQ and ScoreO both take 0(1) time, so Theorem 1 implies 
the total runtime of the dual-tree algorithm is bounded by 0(c^.\R*\(N + i t (^ q ) + 6)). Thus, 
we will bound \R*\ using techniques related to those used by Ram et al. (2009a). The 
bounding of \R*\ is split into two sections: first, we show that when the scale s™ ax is small 
enough, R* is empty. Second, we bound R* when is larger. 

The ScoreO function is such that any node in R* for a given query node V q obeys 

^C(dmin K, A-)) - /C(d max (0^, JT r )) > e. (29) 

Thus, we are interested in the maximum possible value JC(a) — 1C(b) for a fixed value 
of b — a > 0. Due to our assumptions, the maximum value of IC'(-) is K,'{h)\ therefore, the 
maximum possible value of /C(a) — /C(6) is when the interval [a, b] is centered on h. This 
allows us to say that /C(o) — lC{b) < e when (b — a) < (— e/K,'(h)). Note that 

dmaxMq,^r) ~ d min (^,^) < d(p q ,p r ) + 2 S “ aX+1 - d(p q ,p r ) + 2 s ™* +1 (30) 

< 2 s ™ ax+2 . (31) 

Therefore, R* = 0 when 2 s ™ ax+2 < —e/K'{h), or when < log 2 (—e//C'(h)) — 2. 
Consider, then, the case when s™ ax > log 2 (—e//C'(/i)) — 2. Because of the pruning rule, 
for any jV r £ R*, /C(d m i n (A^, dK)) > e; we may refactor this by applying definitions to 
show d(p q ,p r ) < /C -1 (e) + 2 Sr +1 . Therefore, bounding the number of points in the set 
Bs r (pq, lC~ 1 (e) + 2 Sr +1 ) fl C s max is sufficient to bound \R*\. For notational convenience, 
define to = (/C~ 1 (e)/2 Sr +1 ) + 1, and the statement may be more concisely written as 

B Sr (p q ,to2 s r^)nC s? ^. 

Using Lemma 1 with 5 = 2 s ™ and p = 2to gives \R*\ = Cr + ^ loS2 ^. 

The value to is maximized when s™ ax is minimized. Using the lower bound on s™ ax , 
to is bounded as to = —2/C / (/i)/C~ 1 (e)e^ 1 . Finally, with ( = —/C'(h)/C~ 1 (e)e~ 1 , we are able 
to conclude that \R*\ < c ^+r iog 2 ( 2 ?)l _ c ^ + r ic> g 2 Cl _ yp]^ iere f ore _ the entire dual-tree traversal 
takes 0(cr + ^ 1 ° S2 ^(A r + 0)) time. 

The postprocessing step to extract the estimates /(•) requires one traversal of the tree 
£L r \ the tree has 0(N) nodes, so this takes only 0(N) time. This is less than the runtime of 
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the dual-tree traversal, so the runtime of the dual-tree traversal dominates the algorithm’s 
runtime, and the theorem holds. ■ 


The dependence on e (through £) is expected: as e — > 0 and the search becomes exact, £ 
diverges both because e^ 1 diverges and also because /C -1 (e) diverges, and the runtime goes 
to the worst-case 0(N 2 ); exact kernel density estimation means no nodes can be pruned at 
all. 

For the Gaussian kernel with bandwidth a defined by KL g {d) = exp(— d 2 /(2a 2 )), £ does 
not depend on the kernel bandwidth; only the approximation parameter e. For this kernel, 
h = a and therefore —JC g (h) = a~ 1 e^ 1//2 . Additionally, JC g 1 (e) = a^/2\n(l/e). This means 
that for the Gaussian kernel, £ = \J (—21ne)/(ee 2 ). Again, as e —>• 0, the runtime diverges; 
however, note that there is no dependence on the kernel bandwidth a. To demonstrate 
the relationship of runtime to e, see that for a reasonably chosen e = 0.05, the runtime is 
approximately 0(c® ,89 (iV + 9))] for e = 0.01, the runtime is approximately 0(cl 152 (N + 0)). 
For very small e = 0.00001, the runtime is approximately 0(c 2215 (N + 9)). 

Next, consider the exponential kernel: JCi(d) = exp(— d/a). For this kernel, h = 0 (that 
is, the kernel is always convex), so then K,[{h) = cr _1 . Simple algebraic manipulation gives 
/Cj (e) = —a lne, resulting in £ = —/C((h)/C£ 1 (e)e~ 1 = e _1 lne. So both the exponential 
and Gaussian kernels do not exhibit dependence on the bandwidth. 

To understand the lack of dependence on kernel bandwidth more intuitively, consider 
that as the kernel bandwidth increases, two things happen: (a) the reference set R becomes 
empty at larger scales, and (b) /C _1 (e) grows, allowing less pruning at higher levels. These 
effects are opposite, and for the Gaussian and exponential kernels they cancel each other 
out, giving the same bound regardless of bandwidth. 

6.2 Relative Value Approximation 

Approximate kernel density estimation using relative-value approximation may be bounded 
by reducing the absolute-value approximation algorithm (in linear time or less) to relative- 
value approximation. This is the same strategy as performed by Ram et al. (2009a). 

First, we must establish a Score () function for relative value approximation. The 
difference between Equations 27 and 28 is the division by the term \f*(pq)\- But we can 
quickly bound \f*{pq)\- 


I f*(Pq)\ > NIC ( max d(p q ,p r )] . (32) 

This is clearly true: each point in S r must contribute more than IC(max Pr ^s r d(p q ,p r )) 
to /* (jPq)■ Now, we may revise the relative approximation condition in Equation 28: 

|/(^)-r(p 9 )|<6/C max (33) 

where /C max is lower bounded by /C(ma x PrG g r d(p q ,p r )). Assuming we have some estimate 
/C max , this allows us to create a Score() algorithm, given in Algorithm 6. 

Using this, we may prove linear runtime bounds for relative value approximate kernel 
density estimation. 
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Algorithm 6 Relative-value approximate kernel density estimation Score () 

1: Input: query node jV q , reference node JV r , list of node kernel estimates f n 
2: Output: a score for the node combination (Mg,M£), or oo if the combination should 
be pruned 

3: if /C(cZ m i n (M^)) - lC(d max (^ q ,'SK)) < e/C max then 

4: /„(/,) <- /„(/,) + \® P {<A)\ (/C(d min K,^)) + /C(d max (^,^))) / 2 

5: return oo 

6 : end if 

7: return A(d m i n (Mg, M).)) Ai(<i max (Mq, ^r)) 


Theorem 4 Assume that /C(-, •) is a kernel satisfying the same assumptions as Theorem 3. 
Then, given a query set S q and a reference set S r both of size O(N), it is possible to perform 
relative value approximate kernel density estimation (satisfying the condition of Equation 
28) in 0{N) time, assuming that the expansion constant c r of S r is not dependent on N. 

Proof It is easy to see that Theorem 3 may be adapted to the very slightly different Score () 
rule of Algorithm 6 while still providing an O(N) bound. With that Score () function, the 
dual-tree algorithm will return relative-value approximate kernel density estimates satisfying 
Equation 28. 

We now turn to the calculation of /C max . Given the cover trees ST q and £T r with root 
nodes jV r R and ^Af R , respectively, we may calculate a suitable /C max value in constant time: 


j^max 


<fJf R ,jV R ) = d(p, 


£,p?) 


+ 2 s 


s +i 


+ 2 s 


s +i 


This proves the statement of the theorem. 


(34) 


In this case, we have not shown tighter bounds because the algorithm we have proposed 
is not useful in practice. For an example of a better relative-value approximate kernel 
density estimation dual-tree algorithm, see the work of bray and Moore (2003). 

7. Range search and range count 

In the range search problem, the task is to find the set of reference points 

S\p q ] = {Pr e S r : d(p q ,p r ) G [l,u]} (35) 

for each query point p q , where [l,u\ is the given range. The range count problem is practi¬ 
cally identical, but only the size of the set, |5[p 9 ]|, is desired. Our proof works for both of 
these algorithms similarly, but we will focus on range search. A BaseCaseQ and Score () 
function are given in Algorithms 7 and 8, respectively (a correctness proof can be found in 
Curtin et ah, 2013b). The sets N(p q \ (for each p q ) are initialized to 0 at the beginning of 
the traversal. 

In order to bound the running time of dual-tree range search, we require better notions 
for understanding the difficulty of the problem. Observe that if the range is sufficiently 
large, then for every query point p q , S\p q } = S r . Clearly, for S q ~ S r ~ O(N), this cannot 
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Algorithm 7 Range search BaseCaseO 
1: Input: query point p q , reference point p r , range sets N\p q ] and range [l,u\ 
2: Output: distance d between p q and p r 

3: if d(p q ,p r ) G [r m i n ,r max ] and BaseCase(p g , p r ) not yet called then 
4: s\p q ] G- s\p q \ U {p r } 

5: end if 
6: return d 


Algorithm 8 Range search Score() 

1: Input: query node jV qi reference node jF r 

2: Output: a score for the node combination (Oq, jVf), or oo if the combination should 
be pruned 

3: if ) G [Z, u\ or d max (Ai), tA£) G [/, w] then 

4: return d m j n (,A^, ,A^.) 

5: end if 
6: return oo 


be solved in anything less than quadratic time simply due to the time required to fill each 
output array S\p q ]. Define the maximum result size for a given query set S q , reference set 
S r , and range [Z, it] as 


I-S'maxl = max \S\p q \l. (36) 

Pq&Sq 

Small |S max | implies an easy problem; large | Sma. Y | implies a difficult problem. For 
bounding the running time of range search, we require one more notion of difficulty, related 
to how IS'maxI changes due to changes in the range [l,u]. 

Definition 4 For a range search problem with query set S q , reference set S r , range [l,u ], 
and results S\p q ] for each query point p q given as 

S [Pq\ = {Pr - Pr e Sr,l < d(jp q ,p r ) < u}, (37) 

define the a-expansion of the range set S\p q ] as the slightly larger set 

S a \Pq\ = {pr ■ Pr € S r , (1 - a)l < d{p q ,p r ) < (1 + a)u}. (38) 

When the a-expansion of the set S max is approximately the same size as S max , then the 
problem would not be significantly more difficult if the range [l,u\ was increased slightly. 
Using these notions, then, we may now bound the running time of range search. 

Theorem 5 Given a reference set S r of size N with expansion constant c r , and a query 
set S q of size O(N), a search range of [l,u], and using the range search BaseCaseO and 
Score () as given in Algorithms 7 and 8, respectively, with the standard cover tree pruning 
dual-tree traversal as given in Algorithm 1, and also assuming that for some a > 0, 

\S a \p q ]\S\p q }\<C Vp q eS q , (39) 
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the running time of range search or range count is bounded by 


O (4 max ( 4 +0 , |S max | + c) (N + i t (^ q ) + 0)) (40) 

with 9 defined as in Lemma J h fi = |"log 2 (l + a -1 )], and 5 max as defined in Equation 
36. 

Proof Both BaseCaseQ (Algorithm 7) and Score () (Algorithm 8) take 0(1) time. 
Therefore, using Lemma 1, we know that the runtime of the algorithm is bounded by 
0(cf\R*\(N + + 9)). As with the previous proofs, then, our only task is to bound 

the maximum size of the reference set, \R*\. 

By the pruning rule, for a query node jV q , the reference set R* is made up of reference 
nodes jV r that are within a margin of 2 Sq+1 + 2 Sr+1 < 2 s ™ ax+2 of the range [l, u\. Given that 
p r is the point in jV r , 

Pr € ( B Sr (p q ,U + 2 S ™ aX+2 ) n C^ax) \ (. B Sr (p q ,l- 2°™*+*) f| O s max) . (41) 

A bound on the number of elements in this set is a bound on \R*\. First, consider the 
case where u < a _1 2 s ™ ax+2 . Ignoring the smaller ball, take 5 = 2 s ™ ax and p = 4(1 + a -1 ) 
and apply Lemma 1 to produce the bound 

\R*\ < c 4+ ri °g 2 (1 +^ 1) l > (42) 

Now, consider the other case: u > a _1 2 s ™ ax+1 . This means 


Bs r (p q ,u + 2 s ” ax+1 ) \ B Sr {p q ,l- 2 s ” ax+1 ) C B Sr {p q , (1 + a)u) \ B Sr (pq, (1 - a)l). (43) 

This set is necessarily a subset of S a \p q ]; by assumption, the number of points in this 
set is bounded above by |5 Vna X | + C. We may then conclude that |i?*| < l^ ma^ l + C. By 
taking the maximum of the sizes of \R*\ in both cases above, we obtain the statement of 
the theorem. ■ 

This bound displays both the expected dependence on c r and |5 max |. As the largest range 
set S max increases in size (with the worst case being Sm ax ~ N), the runtime degenerates 
to quadratic. But for adequately small S'max the runtime is instead dependent on c r and 
the parameter C of the a-expansion of 5 max . This situation leads to a simplification. 

Corollary 2 For sufficiently small IS'maxI and sufficiently small C, the runtime of range 
search under the conditions of Theorem 5 simplifies to 

0{cl + d{N + i t fiV q ) + 9)). (44) 


In this setting we can more easily consider the relation of the running time to a. Consider 
a = (1/3); this yields a running time of 0(c 8 (N+9)). a = (1/7) yields 0(c 9 (N+i t (^4f } +9)), 
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a = (1/15) yields O(c 10 (N + it{PU q ) + 8)), and so forth. As a gets smaller, the exponent on 
c gets larger, and diverges as a —> 0. 

For reasonable runtime it is necessary that the a-expansion of Smax be bounded. This is 
because the dual-tree recursion must retain reference nodes which may contain descendants 
in the range set S\p q \ for some query p q . The parameter C of the a-expansion allows us to 
bound the number of reference nodes of this type, and if a increases but C remains small 
enough that Corollary 2 applies, then we are able to obtain tighter running bounds. 

8. Conclusion 

We have presented a unified framework for bounding the runtimes of dual-tree algorithms 
that use cover trees and the standard cover tree pruning dual-tree traversal (Algorithm 1). 
In order to produce an understandable bound, we have introduced the notion of cover tree 
imbalance; one possible interesting direction of future work is to empirically and theoreti¬ 
cally minimize this quantity by way of modified tree construction algorithms; this is likely 
to provide both tighter runtime bounds and also accelerated empirical results. 

Our main result, Theorem 1, allows plug-and-play runtime bounding of these algorithms. 
We have shown that Theorem 1 is useful for bounding the runtime of nearest neighbor search 
(Theorem 2), approximate kernel density estimation (Theorem 3), exact range count, and 
exact range search (Theorem 5). With our contribution, bounding a cover tree dual-tree 
algorithm is straightforward and only involves bounding the maximum size of the reference 
set, \R*\- 
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